Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition

نویسندگان

  • Christopher S. Funk
  • K. Bretonnel Cohen
  • Lawrence Hunter
  • Karin M. Verspoor
چکیده

BACKGROUND Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms. RESULTS We present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations. CONCLUSIONS In this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora

Concept recognition tools rely on the availability of textual corpora to assess their performance and enable the identification of areas for improvement. Typically, corpora are developed for specific purposes, such as gene name recognition. Gene and protein name identification are longstanding goals of biomedical text mining, and therefore a number of different corpora exist. However, phenotype...

متن کامل

Biomedical Semantics in the Big Data Era

1 Doing-Harris K, Livnat Y, Meystre S Automated concept and relationship extraction for the Semi-Automated Ontology Management (SEAM) System Journal of Biomedical Semantics 2015, 6:15 doi:10.1186/s13326 -015-0011-7 Ontology; Natural language processing; Terminology extraction Background: We develop medical-specialty specific ontologies that contain the settled science and common term usage. We ...

متن کامل

What's in a 'nym'? Synonyms in Biomedical Ontology Matching

To bring the Life Sciences domain closer to a Semantic Web realization it is fundamental to establish meaningful relations between biomedical ontologies. The successful application of ontology matching techniques is strongly tied to an effective exploration of the complex and diverse biomedical terminology contained in biomedical ontologies. In this paper, we present an overview of the lexical ...

متن کامل

Comprehensive Benchmark of Gene Ontology Concept Recognition tools

The Gene Ontology has evolved as the de facto standard for describing gene function in the biomedical domain. Information about gene function can be often found in written articles. In this work we evaluate three tools capable of recognizing Gene Ontology concepts in text on an automatically generated gold standard of 88,573 articles. The analysis reveals differences in concept recognition for ...

متن کامل

Automated concept and relationship extraction for the semi-automated ontology management (SEAM) system

BACKGROUND We develop medical-specialty specific ontologies that contain the settled science and common term usage. We leverage current practices in information and relationship extraction to streamline the ontology development process. Our system combines different text types with information and relationship extraction techniques in a low overhead modifiable system. Our SEmi-Automated ontolog...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2016